Welcome to the Reproducible Data Science unit BIOL33031.
In this unit, you will acquire the skills necessary to engage in reproducible data science. You will learn how to wrangle data, create data visualizations, and model your data using the open-source data science software, R. Each session will be conducted as a combined seminar and hands-on coding workshop. You will discover how to employ a reproducible workflow to generate analyses that can be replicated. Additionally, you will gain proficiency in general computational skills such as using git and GitHub for version control, as well as Binder for constructing reproducible computational environments. Data science skills, especially proficiency in R, are highly sought after by employers in academia, industry, and business sectors. This unit will equip you with a foundational understanding of data science using R, setting the stage for further specialization, such as machine learning with R.
This unit will be run via a flipped classroom model. This means you need to go through each of the workshops before the associated live computer cluster session. So make sure you’ve viewed all of the week’s content before our first scheduled computer cluster session. You will have an opportunity to work on the activities and/or ask questions about each workshop during the live computer cluster sessions. All the videos in these workshops are best viewed in full-screen mode at 1080 resolution. Audio was recorded with a podcasting microphone, so are best listened to with headphones. YouTube generates subtitles automatically, so please turn those on if you’d find them useful.
This unit will consist of workshops spread across 6 weeks - each workshop will involve a mix of seminar and hands-on programming. The workshops are as follows:
Workshop: Open Research and Reproducibility
Workshop: Getting Started with R
Workshop: Data Wrangling
Workshop: Summarising your data
Workshop: Data visualisation
Workshop: Regression Part 1
Workshop: Regression Part 2
Workshop: ANOVA Part 1
Workshop: ANOVA Part 2
Workshop: Mixed-effects models Part 1
Workshop: Mixed-effects models Part 2
Workshop: Introduction to R Markdown
Workshop: Experimental Power
Workshop: Open source software
These materials are listed as resources to consult if you need further help understanding the material. It is not expected that you read them chapter-by-chapter.
During the hands-on coding sessions, students will receive formative feedback associatedwith each of the practical problems that they will be engaged with. There is one assignment associated with this unit. The full details of this (plus hand in date) will be posted on Blackboard.
The unit aims to increase your understanding of the following:
Knowledge and understanding
Intellectual skills
Practical skills
Transferable skills and personal qualities
There are two parts to this week’s workshop. The first is on Open Research and Reproducibility, the second on Starting with R and RStudio Desktop. Please make sure you complete both parts before the first timetabled session.
In this first part, we will explore the key concepts in open research, and talk about the so-called “replication crisis” in biological, biomedical, psychology, and life sciences research that has resulted in the Open Research movement. We will also discuss the importance of adopting reproducible research practices in your own research, and provide an introduction to various tools and processes you can incorporate into your own research workflows that will allow you to conduct reproducible research.
Workshop: Open Research and Reproducibility
In the second part, you will be introduced to R (the language) and RStudio Desktop (the environment we use to interact with the language). There is also a link to a great talk by the founder of RStudio, J.J. Alaire. At the end there is a video which will show you how to run your first R script.
Workshop: Getting Started with R
There are three parts to this week’s workshop The first part is on Data Wrangling, the second on Summarising Your Data, and the third on Data Visualisation.
The first part will introduce you to a number of key packages known as the tidyverse These packages contain a large number of functions for working with data in tidy format. By making our data wrangling reproducible (i.e., by coding it in R), we can easily re-run this stage of our analysis pipeline as new data gets added. Reproducibility of the data wrangling stage is a key part of the analysis process and often gets overlooked in terms of needing to ensure it is reproducible. To go to this first part, just click on the image below.
Once you have completed this first part and have your R script up and running, click on the image below for the second part where you’ll learn how to aggregate and summarise your data.
Workshop: Summarising your data
In part three, we will explore the basics of Data Visualization using R. You’ll have the opportunity to write an R script on your own computer that will generate some nice data visualisations. Just click on the image below to start.
There are two parts to this week’s workshop. In the first part we will examine Simple Linear Regression (where you build a linear model to predict an outcome variable on the basis of a predictor variable). In the second part we will move onto Multiple Linear Regression (where you build a linear model to predict an outcome variable on the basis of a number of predictor variables).
In part one we will explore Simple Regression in the context of the General Linear Model (GLM). You will also have the opportunity to build some regression models where you predict an outcome variable on the basis of one predictor. You will also learn how to run model diagnostics to ensure you are not violating any key assumptions of regression.
In part two we will explore Multiple Regression in the context of the General Linear Model (GLM). Multiple Regressions builds on Simple Regression, except that having one predictor (as is the case with Simple Regression) we will be dealing with multiple predictors. Again, you will have the opportunity to build some regression models and use various methods to decide which one is ‘best’. You will also learn how to run model diagnostics for these models as you did in the case of Simple Regression.
This week we will cover Analysis of Variance (ANOVA), Analysis of Covariance (ANCOVA), and show how they are both special cases of the General Linear Model.
In part one we will explore Analysis of Variance (ANOVA) in the context of model building in R for between participants designs, repeated measures designs, and factorial designs. You will learn how to use the {afex} package for building models with Type III Sums of Squares, and the {emmeans} package to conduct follow up tests to explore main effects and interactions.
In part two we will explore Analysis of Covariance (ANCOVA). We will also examine ANOVA and ANCOVA as special cases of regression and see how we can build both via a linear model. By then doing this yourselves, you wil hopefully be convinced that ANOVA and regression are really the same thing.
There are three parts to this week’s content. The first two part focus on Mixed Models, while the third focuses on using R Markdown. Your assignment for this unit needs to be created using R Markdown. You will then knit your R Markdown file to generate a .html file. It is this .html file that you need to submit as your assignment.
In this first part we will see how mixed models combine aspects of linear regression (for model fitting) while circumventing the need for observations to be independent of each other. We will also examine how we model the influence of random effects in our mixed models, and see how mixed models can cope with unbalanced designs and missing data.
Workshop: Mixed-effects models Part 1
In the second part we will examine mixed models for factorial designs, and explore how to model non-continuous dependent variables (e.g., binary and ordinal outcome variables) using the glmer() family of mixed models.
Workshop: Mixed-effects models Part 2
In the final part we will generate a report in .html format using R Markdown. Reports written using R Markdown allow you to combine narrative that you’ve written alongside R code chunks, and the output associated with those code chunks all in one knitted document. The assignment for this unit needs to be produced using R Markdown and you need to submit the .html file you generate using R Markdown via Blackboard.
Workshop: Introduction to R Markdown
This workshop covers experimental power (and why it is important). One of the insights revealed by the “replication crisis” is that very often research is underpowered for the effect size of interest (i.e., even if the effect is there, your experiment is unlikely to find it). Many of the issues stem from researchers not spending sufficient time considering the power aspects of their research design. In this workshop, we will look at an overview of some of the issues - just click on the image below to view this:
This workshop provides a very brief overview of open source software, the use of which is arguably key for researchers to be able to adopt open and reproducible research workflows. To view this part, just click on the image below.
Workshop: Open source software
Original material: The material in this workshop were originally created using open source where possible using an Entroware Apollo laptop running GNU/Linux distro Ubuntu 20.04 LTS (Focal Fossa). The audio was captured with a Fifine USB Podcasting microphone and the video with a Razer Kiyo webcam. The audio and video were recorded using Open Broadcast Software and edited using Shotcut. The R code was written using R 3.6.3, and run in the RStudio Desktop IDE version 1.3.959. Ubuntu 20.04 LTS (Focal Fossa), OBS, Shotcut, R, and RStudio Desktop are all open source.
The structure for this unit was inspired by the Sharing At Short Notice webinar by Alison Hill and Desirée De Leon.
The repo for each workshop can be accessed via the ‘Improve this Workshop’ link at the bottom of each workshop page. The workshops and this website were all written using R Markdown and the website is hosted on GitHub Pages via deployment from this GitHub repository, which was forked from the original repository.
The source code for each of the Workshops above is licensed under the MIT license, and the lecture content under CC-BY.
Updates in 2023: The unit was updated using Ubuntu 20.04.5 LTS (Focal Fossa), including R 4.2.2, gedit (3.36.2) and various Unix utilities (grep, sed) run using BASH.